OcrV1, Main, Exploration, bibRecord, 001525

Heuristics for identification of bibliographic elements from title pages

Identifieur interne : 001525 ( Main/Exploration ); précédent : 001524; suivant : 001526

Heuristics for identification of bibliographic elements from title pages

Auteurs : Durga Sankar Rath [Inde] ; A. R. D. Prasad [Inde]

Source :

Library Hi Tech [ 0737-8831 ] ; 2004-12-01.

RBID : ISTEX:444D56D27EBF7681527E9F282D508A59D2646702

Abstract

This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.

Url:

https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/pdf

DOI: 10.1108/07378830410570494

Affiliations:

Inde

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000C65
to stream Istex, to step Curation: 000C46
to stream Istex, to step Checkpoint: 000D60
to stream Main, to step Merge: 001576
to stream Main, to step Curation: 001525

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
<author><name sortKey="Sankar Rath, Durga" sort="Sankar Rath, Durga" uniqKey="Sankar Rath D" first="Durga" last="Sankar Rath">Durga Sankar Rath</name>
</author>
<author><name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:444D56D27EBF7681527E9F282D508A59D2646702</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1108/07378830410570494</idno>
<idno type="url">https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000C65</idno>
<idno type="wicri:Area/Istex/Curation">000C46</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D60</idno>
<idno type="wicri:doubleKey">0737-8831:2004:Sankar Rath D:heuristics:for:identification</idno>
<idno type="wicri:Area/Main/Merge">001576</idno>
<idno type="wicri:Area/Main/Curation">001525</idno>
<idno type="wicri:Area/Main/Exploration">001525</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
<author><name sortKey="Sankar Rath, Durga" sort="Sankar Rath, Durga" uniqKey="Sankar Rath D" first="Durga" last="Sankar Rath">Durga Sankar Rath</name>
<affiliation wicri:level="1"><country xml:lang="fr">Inde</country>
<wicri:regionArea>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata</wicri:regionArea>
<wicri:noRegion>Kolkata</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
<affiliation wicri:level="1"><country xml:lang="fr">Inde</country>
<wicri:regionArea>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka</wicri:regionArea>
<wicri:noRegion>Karnataka</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Library Hi Tech</title>
<idno type="ISSN">0737-8831</idno>
<imprint><publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2004-12-01">2004-12-01</date>
<biblScope unit="volume">22</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="389">389</biblScope>
<biblScope unit="page" to="396">396</biblScope>
</imprint>
<idno type="ISSN">0737-8831</idno>
</series>
<idno type="istex">444D56D27EBF7681527E9F282D508A59D2646702</idno>
<idno type="DOI">10.1108/07378830410570494</idno>
<idno type="filenameID">2380220408</idno>
<idno type="original-pdf">2380220408.pdf</idno>
<idno type="href">07378830410570494.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0737-8831</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</div>
</front>
</TEI>
<affiliations><list><country><li>Inde</li>
</country>
</list>
<tree><country name="Inde"><noRegion><name sortKey="Sankar Rath, Durga" sort="Sankar Rath, Durga" uniqKey="Sankar Rath D" first="Durga" last="Sankar Rath">Durga Sankar Rath</name>
</noRegion>
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001525 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001525 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:444D56D27EBF7681527E9F282D508A59D2646702
   |texte=   Heuristics for identification of bibliographic elements from title pages
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Heuristics for identification of bibliographic elements from title pages

Heuristics for identification of bibliographic elements from title pages

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri